Skip to content

proton#15

Closed
gnurizen wants to merge 14 commits intomainfrom
proton
Closed

proton#15
gnurizen wants to merge 14 commits intomainfrom
proton

Conversation

@gnurizen
Copy link
Copy Markdown
Contributor

@gnurizen gnurizen commented Apr 9, 2026

  • Rewrite parcagpu to use Proton's CUPTI infrastructure
  • Update proton callback API names for upstream sync
  • Update proton submodule to latest upstream sync
  • Replace interval-based rate limiter with token bucket algorithm
  • Add activity_batch USDT probe and fix test infrastructure
  • Various fixes to make arm64 work
  • And make amd64 compile too
  • Small cleanups/formatting
  • Shorten names
  • Checkpoint PC sampling tweaking
  • Stall reason map handling, prepping for batched pc samples
  • Flush out cubin processing, sass lookup and pc sampling probe batching
  • PC sampling: probabilistic windowed start/stop with KERNEL_SERIALIZED mode
  • Cleanup related to usdt/cupti extraction

gnurizen and others added 14 commits March 23, 2026 16:42
Major changes:
- Use Proton as a git submodule for CUPTI callback handling
- Rewrite in C++ using Proton's CuptiApi and callback patterns
- Add PC sampling support for continuous GPU profiling
- Simplify build to single library (works with any CUDA version at runtime)
- Use CMake build system
- Consolidate GitHub workflows into single build.yml
- Update Dockerfile to Ubuntu 24.04 (fixes USDT probe generation)

The library now uses Proton's dynamic CUPTI loading, so a single build
works with CUDA 12.x and 13.x at runtime.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
setDriverCallbacks renamed to setLaunchCallbacks in upstream proton.
The simple 500μs interval check could only pass 2000 samples/sec
regardless of actual load. The token bucket (configurable via
PARCAGPU_RATE_LIMIT, default 100/sec) smooths bursts while
maintaining a predictable average rate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add parcagpuActivityBatch() probe that fires with batches of up to 128
activity record pointers, enabling BPF consumers to read kernel timing
data directly from CUPTI buffers without per-record probe overhead.

Build/test changes:
- Link test binary against mock CUPTI/CUDA with --no-as-needed so
  Proton's dlopen(RTLD_NOLOAD) finds the mocks at runtime
- Fix make test to run the test binary directly with LD_LIBRARY_PATH
  (ctest had no tests registered)
- Add make bpf-test and make test-multi targets for BPF activity
  parser integration testing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… mode

Implement interval-gated probabilistic PC sampling that only serializes
kernels during active sampling windows, not for the entire process lifetime.

Architecture:
- CUPTI lifecycle: enable (once) → start/stop (per window) → disable (once)
- Enable START_STOP_CONTROL attribute so start/stop work from CUPTI callbacks
- Collection mode is KERNEL_SERIALIZED for per-kernel correlation
- Probabilistic window: every PARCAGPU_PC_SAMPLING_INTERVAL seconds, roll a
  PARCAGPU_PC_SAMPLING_PROBABILITY die; if it hits, start sampling until the
  window closes, then stop and drain data
- start()/stop() are mutex-guarded and idempotent (no double-start/stop races)
- ctxSynchronize before start to satisfy CUPTI's GPU-idle requirement

Key changes:
- pc_sampling.cpp: Session-based enable with per-window start/stop, semaphore-
  gated stall reason map replay (replaces rate-limited emission), CUPTI 12.4
  ABI version check (v22 correlationId boundary), graceful permission failure
  handling in enable
- cupti.cpp: Probabilistic window state machine in ENTER/EXIT callbacks,
  env var config (probability, interval), env_config validation
- probes.d: Add error USDT probe for surfacing CUPTI failures to BPF
- test/mock_cupti.c: Full PC sampling mock with real cubin from pc_sample_toy,
  real SASS offsets for source-line correlation, 11-entry sample table cycling
  through shmem_bounce/hash_churn/trig_storm kernels
- test/mock_cuda.c: Add cuCtxSynchronize stub
- test/test-pc-mock.sh: New GPU-less test using mock libs and real cubin
- test/test-pc-real.sh: Set probability=1 interval=0.5 for reliable test hits
- test/bpf/: Move CUPTI struct defs to shared cupti_bpf.h, add error event
  handling, CUDA 12.4+ correlationId support
- test/CMakeLists.txt: Build mock CUDA driver library

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
usdt headers now live in parca-dev/usdt and the cupti bpf
headers now live in this project.  So we don't need to vendor
otel anymore.
@gnurizen
Copy link
Copy Markdown
Contributor Author

gnurizen commented Apr 9, 2026

I'm gonna squash first and redo this

@gnurizen gnurizen closed this Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant